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ABSTRACT 

ProGlycProt (http://www.proglycprot.org/) is an 
open access, manually curated, comprehensive re- 
pository of bacterial and archaeal glycoproteins 
with at least one experimentally validated glycosite 
(glycosylated residue). To facilitate maximum infor- 
mation at one point, the database is arranged under 
two sections: (i) ProCGP— the main data section 
consisting of 95 entries with experimentally charac- 
terized glycosites and (ii) ProUGP— a supplementary 
data section containing 245 entries with experimen- 
tally identified glycosylation but uncharacter- 
ized glycosites. Every entry in the database is fully 
cross-referenced and enriched with available pub- 
lished information about source organism, coding 
gene, protein, glycosites, glycosylation type, at- 
tached glycan, associated oligosaccharyl/glycosyl 
transferases (OSTs/GTs), supporting references, 
and applicable additional information. Interestingly, 
ProGlycProt contains as many as 174 entries for 
which information is unavailable or the character- 
ized glycosites are unannotated in Swiss-Prot 
release 2011_07. The website supports a dedicated 
structure gallery of homology models and crystal 
structures of characterized glycoproteins in addition 
to two new tools developed in view of emerging in- 
formation about prokaryotic sequons (conserved 
sequences of amino acids around glycosites) that 
are never or rarely seen in eukaryotic glycoproteins. 
ProGlycProt provides an extensive compilation of 
experimentally identified glycosites (334) and glyco- 
proteins (340) of prokaryotes that could serve as an 
information resource for research and technology 
applications in glycobiology. 



INTRODUCTION 

Protein glycosylation in prokaryotes is a recent but rapidly 
growing area of research. An expanding repertoire of pro- 
karyotic glycoproteins is increasingly being explored as a 
target for therapeutic interventions in diagnostics (1), 
vaccines (2), as future nano-machines using proteins like 
S layer glycoproteins (3) and as a strategy to improve in- 
dustrially important enzymes for specific attributes (4,5). 

The prokaryotes indeed synthesize a wide variety of 
glycans linked covalently to their proteins, commonly at 
the amide group of Asn (N-linked), hydroxyl group of 
Ser/Thr/Tyr (O-linked) and rarely at the sulphur residue 
of Cys (S-linked) (6). Equally, they display a diversity in 
the mechanisms of glycosylation that include well-known, 
en bloc N-glycan transfer (Archaea & Campylobacter spp.) 
and sequential O-glycan transfer (Pseudomonas spp., 
Campylobacter spp. etc.) as well as novel, en bloc 
O-glycan transfer (Neisseria spp.) and sequential 
N-glycan transfer (Haemophilus influenzae) (7,8). 
Accordingly, it has led to identification and characteriza- 
tion of several new protein glycosylation-associated 
enzymes, OSTs and GTs in prokaryotes (7,8). Likewise, 
hundreds of new glycoproteins have now been identified 
experimentally, across all major phyla of bacteria and 
archaea (Supplementary Figure SI), implicating them in 
diverse biological functions in cellular and extra cellular 
milieu (9). To name a few, Apa protein of human pathogen 
Mycobacterium tuberculosis (10), flaA of phytopathogen 
Acidovorax avenae Kl H8301 (11), glycosylated pilin 
protein of Neisseria gonorrhoeae (12), and adhesins of 
several pathogenic bacterial species are the examples of 
glycoproteins that are involved in crucial host-pathogen 
interactions, modulation of the host immune system and 
virulence of the pathogenic bacterial species. Interestingly, 
in the last decade, as many as 67 new glycoproteins have 
been characterized for their glycosites in prokaryotes 
(Figure 1). Around the same time, many reviews and 
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Figure 1. Trends of experimental research on prokaryotic glycopro- 
teins in last 35 years as derived from ProGlycProt database. 



research articles have appeared in reputed scientific 
journals containing focused compilations of known infor- 
mation about these glycoproteins (3,9,13-16, http://www 
.proglycprot.org/recent_review.aspx). The rise in the inter- 
est in glycoproteins and glycobiology of prokaryotes is 
obvious. However, currently, there is no specialized 
resource for prokaryotic glycoproteins providing informa- 
tion in a comprehensive manner. Also, a dedicated 
resource for prokaryotic glycoproteins analogous to 
O-GLYCBASE (a collection of O- and C-glycosylated 
proteins of eukaryotes, 17) will complement the ongoing 
efforts of glycoprotein annotation as at Swiss-Prot (18) 
and the one like dbPTM, integrating experimentally 
validated information on post-translational modifications 
(19). Further, with the availability of high-throughput 
techniques like mass spectroscopy, lectin arrays and 
emerging data analysis tools, a large influx of data on 
prokaryotic glycoproteins are anticipated. 

In view of this necessity, and to cater to general interests 
in the science of prokaryotic glycoproteins, we have de- 
veloped ProGlycProt as a manually curated, comprehen- 
sive repository of published information on bacterial and 
archaeal glycoproteins with at least one experimentally 
characterized glycosite. It is a modest but focused begin- 
ning of an effort to provide enough experimental infor- 
mation at one point, to glean insights into the relationship 
between a glycoprotein, its OSTs/GTs, protein glyco- 
sylation-linked gene (s) and their genomic context. 

In this database, a characterized glycoprotein is the one 
where at least one glycosite is validated through experi- 
ments like Edman degradation, mass spectroscopy or 
site-directed mutagenesis. Similarly, an uncharacterized 
glycoprotein is the one, where glycosylation but not 
glycosite (s) is identified by one or more experimental 
methods, e.g. aberrant migration on SDS-PAGE, sugar 
specific staining, lectin binding, etc. 



DATA COLLECTION AND CURATION 

The first release of ProGlycProt with 340 entries is a result 
of an extensive literature search followed by the manual 



curation of the data compiled from a total of 410 research 
articles and review papers (http://www.proglycprot.org/ 
Bibliography. aspx). For ProCGP, the initial literature 
collection was built using various keyword searches 
made at Pubmed (20), Google Scholar and the Web of 
science. Additional references relevant to this study were 
retrieved from the citations given in aforementioned 
research and review articles. As a result, ProCGP now 
lists 88 native glycoproteins, in addition to seven proteins 
and peptides that are glyco-engineered using in vitro j 
in vivo and enzymatic or synthetic approaches. ProCGP 
represents all three experimentally known protein-glycan 
linkages in prokaryotes, namely N, O and S with infor- 
mation on 132 N-glycosites, 196 O-glycosites and 6 S- 
glycosites (Supplementary Figure S2). Both identical 
(five proteins with 18 glycosites are identical in the current 
database) and homologous sequences are included to 
provide a complete primary list of experimentally charac- 
terized prokaryotic glycoproteins from which a non- 
redundant dataset can be derived easily as required by 
the users. In some cases, a redundant entry may provide 
interesting experimental information. For example, 
ProGlycProt ID AC 102 provides information on in vivo 
N-glycosylation at noncanonical sequon NX(N/L/V) 
(X/ P) in engineered mutants of a cell surface glycopro- 
tein (CSG/S layer glycoprotein derived from AC101) at 
position N36 in the full-length protein by a yet unknown 
OST in archaea Halobacterium salinarum [known as 
H. halobium previously (16,21)]. Similarly, identical 
entries BC130, 132, 133, 135 and 136 are included as 
each belongs to a different strain. First, all entries in 
ProCGP are manually corrected for incorporation of 
mutational changes/sequence conflicts/engineered se- 
quences, if any, as per the experimental data and later 
annotated for experimentally verified glycosites. A visual 
display of these manually annotated sequences is avail- 
able under subfield titled 'glycosite (s) annotated protein 
sequence'. Therefore, this field is a true identifier for 
redundancy estimation in the database. The glyco- 
protein entries (21 in number) retrieved initially from 
Swiss-Prot to nucleate data-section ProCGP are revised 
as per the updated literature in applicable cases like 
S-layer glycoprotein of H. salinarum, S-layer protein of 
Haloferax volcanii and AIDA auto-transporter protein 
of Escherichia coli. A sequence conflict is addressed for 
HisJ protein of Campylobacter jejuni. Finally, a cross- 
check with BCSDB version 3.0 (22), O-GLYCBASE 
version 6.0 and Swiss-Prot release 201 1_07 suggests 
that ProCGP is a comprehensive, exclusive and cur- 
rently the largest compilation of characterized prokary- 
otic glycoproteins and their glycosites (Supplementary 
Tables S1-S3). 

In parallel, cataloguing of uncharacterized prokaryotic 
glycoprotein entries was made under data-section 
ProUGP from independent reviews and research articles 
published in various journals as mentioned in the 
Introduction. Nonetheless, ProUGP contains at least 
107 experimentally identified glycoprotein entries from 
prokaryotes (with unsequenced genomes) that are not 
available in Swiss-Prot release 2011 07. 
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DATA ARRANGEMENT AND ACCESS 

ProGlycProt is developed by integrating the data in 
MSSQL, an object-relational database management 
system (RDBMS), which works at the backend, and the 
web interface was built in ASP.Net 2.0 with C#, HTML, 
Java Script & CSS. 

The complete data are arranged below menu 
ProGlycProtdb under 10 broad fields that are further 
split into 47 subfields out of which 18 are content fields 
and 29 provide cross-references/links facilitating an easy 
access to existing information (Figure 2). The broad fields 
and subfields contain information for an entry as defined 
below: 

ProGlycProt ID: a unique ProGlycProt ID that 
starts with series AC to indicate an archaeal 
characterized glycoprotein and BC to indicate bacterial 
characterized glycoprotein. Similarly, AU and BU series 
indicate archaeal and bacterial uncharacterized glycopro- 
teins, respectively. 

Organism information: contains general information 
about the source archaeal or bacterial species/strain. 

Genome sequences: provides links to the available 
genome sequences and additional information like note 
on pathogenicity of source bacterial species/strain. 



Gene information: enlists general information about the 
coding gene with relevant links. 

Protein information: enlists name and other general in- 
formation about the experimentally identified glycopro- 
tein with relevant links. 

Protein structure: provides available crystal structures or 
homology model with related links. 

Glycosylation status: contains relevant links and infor- 
mation derived mainly from the literature about experi- 
mentally identified glycosites, type of glycosylation, 
experimental methods used to detect and define 
glycosites, a glycosite sequence logo and functional im- 
plication of glycosylation. 

Glycan information: provides linear glycan structure 
(usually in standard IUPAC linear notation) correspond- 
ing BCSDB ID link and method of characterization of 
the glycan. 

Protein glycosylation-linked gene(s): provides informa- 
tion about related, experimentally validated and pre- 
dicted OSTs/GTs and relevant links 

Literature: a tabulated bibliography and interesting add- 
itional information is given that could not be placed 
under aforesaid fields. For example, if a protein is glyco- 
engineered or native, information about foreign OST 
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Figure 2. ProGlycProt data arrangement/retrieval schema. 
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used to glycosylate a protein of a given organism, sequon 
features, etc. 

ProGlycProt is searchable by and for multiple parameters. 
A typical search result display (Supplementary Figure S3) 
and detailed note on data access is available as supplemen- 
tary information. 

TOOLS 

A part of the literature in ProCGP, as discussed below, 
defines novel and potential sequon features in different 
bacterial glycoproteins belonging to different species. 
Some of these sequons are unique to prokaryotes. In the 
same context, there is a growing concern that existing 
glycosite prediction tools (as listed at http://www 
.proglycprot.org/related_tools_database.aspx) might not 
be sufficient or suitable for best analysis of prokaryotic 
glycoproteins (8). Interestingly, in a recent study by 
Comstock's group, in Bacteroides fragilis as many as 
eight new proteins have been characterized as glycopro- 
teins, upon identification of the sequon (D)(S/T)(A/I/L/V/ 
M/T) in corresponding sequences in bacterial proteome 
(23). The same group of researchers had validated this 
sequon experimentally while characterizing first 
Bacteroides glycoprotein BF2494 (24). Encouraged by 
this, we have developed tools Map Sequon (http://www 
.progpdb.org/Mapsequon.aspx) and Glyseq Extractor 
(http://www.progpdb.org/glyseq_extractor.aspx) that we 
believe can be of great help for making beginners' estimate 
of putative glycoproteins in prokaryotes, especially when 
one has to deal with proteome scale data. Map Sequon 
provides visual display and information about presence, 
spread or clustering of specified sequons in the input 
protein sequence(s). Similarly, Glyseq Extractor helps in 
retrieving defined sequence lengths around a sequon for 
statistical analysis of the glycosites. Based purely on the 
insights from the published literature irrespective of their 
statistical significance, the following sequons as found in 
native glycoproteins have been included in one or both the 
tools: 

Typical in eukaryotes, NX(S/T) (X^P) sequon is re- 
quired for N-glycosylation in glycoproteins of 
Gammaproteobacteria [HMW1 protein of H. influenza, 
(25)] as well as in almost all archaeal species (16). A 
recent characterization of PglB homolog of Delta- 
proteobacteria Desulfovibrio desulfuricans also suggests a 
preference for NX(S/T) sequon (26). 

On the other hand, N-glycosylation at (D/E)XxNX(S/T) 
(X! and X^P) sequon has almost always been found 
mediated by PglB protein (OST) of Campylobacter 
species and recently in case of Helicobacter pullorum that 
all belong to class Epsilonproteobacteria (27,28). 

With currently available data, sequon (D)(S/T)(A/I/L/ 
V/M/T) should be considered as an O-glycosylation 
feature exclusive to phylum Bacteroidetes. The sequon 
has an aspartate (D) preceding the glycosylated T or S 
which is followed by an amino acid with one or more 
methyl groups (24). The presence of this sequon has 
been observed consistently in glycoproteins of various 
members of this family belonging to all three but different 



classes, namely Flavobacteria, Sphingobacteria and 
Bacteroidia. One exception to this, Chondroitinase-B of 
Pedobacter heparinus lacks a methyl group containing 
amino acid at +1 position at the actual glycosylated 
sequon DSN (29) suggesting DS as a possible independent 
sequon feature that is supported in previous literature as 
well (30). 

Similarly, glycosylation at tyrosine (Y) that is always 
preceded by valine (V) has been observed in all four 
sites of S-layer glycoprotein of Thermoanaerobacter kivui 
[original name Acetogenium kivui, phylum Firmicutes 
(31)], the first-available characterized glycoprotein with 
O-glycosylation at tyrosine. Therefore, we found it im- 
portant to include DS as well as VY in our tool (s) to 
provide maximum coverage for possible sequons in pro- 
karyotic glycoproteins. 

The other common features observed around glycosites 
of O-glycosylated proteins of bacteria are S/T low com- 
plexity region at flexible-loop region of protein as in case 
of N. gonorrhoeae (32) and a eukaryotic mucin type Pro-, 
Ala-, Thr- and Ser-rich domains in Actinobacteria (33). 

An additional tool BLAST (34) provides an easy re- 
trieval of information using sequence similarity search 
against ProGlycProt. All these applications are accessible 
from ProGlycProt website under menu Tools. 

WEB INTERFACE AND ADDITIONAL FEATURES 

A free access to ProGlycProt database, tools and other 
features is available at http://www.proglycprot.org/. The 
curated data files, applications and additional features 
are arranged under four independent pull-down menus: 
ProGlycProtdb, Structure Gallery, Tools and Links. The 
browsing-enabled database statistics, our contact details 
and submission form for a new glycoprotein entry are 
available from the home page. A quick help is facilitated 
in the form of brief explanatory notes at the top of every 
page, explanatory text beneath various buttons, example 
display page and a detailed help section consisting of 
relevant FAQs, glossary of terms, and a downloadable 
tutorial on how to use ProGlycProt. Structure gallery 
renders independently an easy retrieval of crystal struc- 
tures and homology models of characterized glycopro- 
teins. Whereas a list of existing related databases/tools 
and a searchable bibliography and relevant recent reviews' 
list is available under links. An overall database design 
and flow of information in ProGlycProt is shown in 
Figure 2. More details on data access and print/ 
download options under various menus are available as 
supplementary material. 

CURRENT SCOPE AND FUTURE PERSPECTIVE 

First release of ProGlycProt provides an extensive collec- 
tion of experimentally identified prokaryotic glycosites 
(334), glycoproteins (95) and related information to set a 
stage for future statistical analysis of prokaryotic glyco- 
sites, neighbouring residues and 3D folds that can then 
provide fresh insights into the specificities of related 
OSTs and differences in the mechanisms of protein 
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glycosylation between prokaryotes and eukaryotes. For 
the reasons that ProGlycProt has a broad taxonom- 
ic coverage (Supplementary Figure SI) and published 
evidence of glycosylation for all entries, it provides an 
updated and realistic estimate of the extent of occurrence 
of protein glycosylation in prokaryotes. To serve a 
broader interest in prokaryotic glycoproteins, OSTs and 
associated GTs for their potential applied and basic appli- 
cations (1-5,35), the database provides a variety of bio- 
logically and experimentally relevant information 
(Supplementary Table SI and S2) about both native and 
glyco-engineered proteins of prokaryotes in addition to 
their cataloguing. Existing entries are updated in real 
time as soon as relevant literature is published or 
obtained. Otherwise, a general update policy is once in 
three months. The future versions aim at introducing 
in-depth information on prokaryotic OSTs along with 
continued compilation of characterized and 
uncharacterized glycoproteins under respective sections 
& enhanced structural/image inputs for glycan entries in 
ProCGP. 

SUPPLEMENTARY DATA 

Supplementary Data are available at NAR online: 
Supplementary Tables 1-3, Supplementary Figures 1-3. 
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